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Abstract 

Current distributed representations of 
words show little resemblanee to theo¬ 
ries of lexieal semanties. The former 
are dense and uninterpretable, the lat¬ 
ter largely based on familiar, diserete 
elasses (e.g., supersenses) and relations 
(e.g., synonymy and hypernymy). We pro¬ 
pose methods that transform word vee- 
tors into sparse (and optionally binary) 
veetors. The resulting representations are 
more similar to the interpretable features 
typieally used in NLP, though they are dis- 
eovered automatieally from raw eorpora. 
Beeause the veetors are highly sparse, they 
are eomputationally easy to work with. 
Most importantly, we find that they out¬ 
perform the original veetors on benehmark 
tasks. 

1 Introduction 

Distributed representations of words have been 
shown to benefit NLP tasks like parsing (Lazari- 
dou et al., 2013; Bansal et al., 2014), named en¬ 
tity reeognition (Guo et al., 2014), and sentiment 
analysis (Soeher et al., 2013). The attraetion of 
word veetors is that they ean be derived direetly 
from raw, unannotated eorpora. Intrinsie evalua¬ 
tions on various tasks are guiding methods toward 
diseovery of a representation that eaptures many 
faets about lexieal semanties (Turney, 2001; Tur¬ 
ney and Pantel, 2010). 

Yet word veetors do not look anything like the 
representations deseribed in most lexieal seman- 
tie theories, whieh foeus on identifying elasses of 
words (Levin, 1993; Baker et al., 1998; Sehuler, 
2005) and relationships among word meanings 
(Miller, 1995). Though expensive to eonstruet, 
eoneeptualizing word meanings symbolieally is 
important for theoretieal understanding and also 


when we ineorporate lexieal semanties into eom- 
putational models where interpretability is de¬ 
sired. On the surfaee, discrete theories seem in¬ 
commensurate with the distributed approach, a 
problem now receiving much attention in compu¬ 
tational linguistics (Lewis and Steedman, 2013; 
Kiela and Clark, 2013; Vecchi et al., 2013; Grefen- 
stette, 2013; Lewis and Steedman, 2014; Paperno 
et al, 2014). 

Our contribution to this discussion is a new, 
principled sparse coding method that transforms 
any distributed representation of words into sparse 
vectors, which can then be transformed into binary 
vectors (§2). Unlike recent approaches of incorpo¬ 
rating semantics in distributional word vectors (Yu 
and Dredze, 2014; Xu et al., 2014; Faruqui et al., 
2015), the method does not rely on any external 
information source. The transformation results in 
longer, sparser vectors, sometimes called an “over¬ 
complete” representation (Olshausen and Field, 
1997). Sparse, overcomplete representations have 
been motivated in other domains as a way to in¬ 
crease separability and interpretability, with each 
instance (here, a word) having a small number 
of active dimensions (Olshausen and Field, 1997; 
Lewicki and Sejnowski, 2000), and to increase 
stability in the presence of noise (Donoho et al., 
2006). 

Our work builds on recent explorations of spar¬ 
sity as a useful form of inductive bias in NLP and 
machine learning more broadly (Kazama and Tsu- 
jii, 2003; Goodman, 2004; Friedman et al., 2008; 
Glorot et al., 2011; Yogatama and Smith, 2014, 
inter alia). Introducing sparsity in word vector di¬ 
mensions has been shown to improve dimension 
interpretability (Murphy et al., 2012; Fyshe et al., 
2014) and usability of word vectors as features in 
downstream tasks (Guo et al., 2014). The word 
vectors we produce are more than 90% sparse; we 
also consider binarizing transformations that bring 
them closer to the categories and relations of lex- 



ical semantic theories. Using a number of state- 
of-the-art word vectors as input, we find consis¬ 
tent benefits of our method on a suite of standard 
benchmark evaluation tasks (§3). We also evalu¬ 
ate our word vectors in a word intrusion experi¬ 
ment with humans (Chang et ah, 2009) and find 
fhaf our sparse vecfors are more inferprefable fhan 
fhe original vectors (§4). 

We anficipafe fhaf sparse, binary vecfors can 
play an imporfanf role as fealures in sfafisfical 
NLP models, which sfill rely predominanfly on 
discrefe, sparse fealures whose inlerprefabilily en¬ 
ables error analysis and continued developmenf. 
We have made an implemenfafion of our mefhod 
publicly available.' 

2 Sparse Overcomplete Word Vectors 

We consider mefhods for Iransforming dense word 
vecfors lo sparse, binary overcomplele word vec- 
lors. Fig. 1 shows Iwo approaches. The one on fhe 
lop, mefhod A, converls dense vectors lo sparse 
overcomplele vectors (§2.1). The one benealh, 
mefhod B, converls dense vecfors to sparse and bi¬ 
nary overcomplele vectors (§2.2 and §2.4). 

Lei V be fhe vocabulary size. In fhe following, 
X G is fhe malrix conslrucled by slack¬ 

ing V non-sparse “inpul” word vectors of lenglh 
L (produced by an arbilrary word vector estima¬ 
tor). We will refer to Ihese as initializing vectors. 
A G conlains V sparse overcomplele word 

vectors of lenglh K. “Overcomplele” represenla- 
lion learning implies lhal K > L. 

2,1 Sparse Coding 

In sparse coding (Lee el ah, 2006), Ihe goal is to 
represenl each inpul vector x* as a sparse linear 
combination of basis vectors, a*. Our experimenls 
consider four initializing melhods for Ihese vec¬ 
tors, discussed in Appendix A. Given X, we seek 
to solve 

argmin ||X — DAH^ -|- AO(A) -|- r||D|| 2 , (1) 

D,A 

where D G is Ihe dictionary of basis vec¬ 

tors. A is a regularization hyperparameler, and O is 
Ihe regularizer. Here, we use Ihe squared loss for 
Ihe reconslruclion error, bul olher loss functions 
could also be used (Lee el ah, 2009). To oblain 
sparse word represenlalions we will impose an ii 

'https://github.com/mfaruqui/ 
sparse-coding 


penally on A. Eq. 1 can be broken down into loss 
for each word vector which can be optimized sep¬ 
arately in parallel (§2.3): 

y 

argmin^^ ||xj-Daj||2-hA||aj||i-hr||D||| (2) 

DA ^ 

where m* denotes Ihe ilh column vector of malrix 
M. Note lhal Ihis problem is nol convex. We refer 
to Ibis approach as method A. 

2.2 Sparse Nonnegative Vectors 

Nonnegalivily in Ihe fealure space has often been 
shown to correspond to inlerprelabilily (Lee and 
Seung, 1999; Cichocki el ah, 2009; Murphy el ah, 
2012; Fyshe el ah, 2014; Fyshe el ah, 2015). To 
oblain nonnegative sparse word vectors, we use a 
variation of Ihe nonnegative sparse coding melhod 
(Hoyer, 2002). Nonnegative sparse coding furlher 
conslrains Ihe problem in Eq. 2 so lhal D and aj 
are nonnegative. Here, we apply Ihis conslrainl 
only to Ihe represenlalion vectors {a*}. Thus, Ihe 
new objective for nonnegative sparse vectors be¬ 
comes: 

y 

argmin ^ ||xi-Daj|| 2 -hA||ai||i-hr||D ||2 

i=i 

(3) 

This problem will play a role in our second ap¬ 
proach, method B, to which we will relurn shorlly. 
This nonnegalivily conslrainl can be easily incor¬ 
porated during optimization, as explained next 

2.3 Optimization 

We use online adaptive gradienl descenl (Ada- 
Grad; Duchi el ah, 2010) for solving Ihe optimiza¬ 
tion problems in Eqs. 2-3 by updating A and D. 

In order to speed up Iraining we use asynchronous 
updates to Ihe parameters of Ihe model in parallel 
for every word vector (Duchi el ah, 2012; Heigold 
el ah, 2014). 

However, direclly applying stochastic subgradi- 
enl descenl to an -regularized objective fails to 
produce sparse solutions in bounded time, which 
has motivated several specialized algorilhms lhal 
largel such objectives. We use Ihe AdaGrad vari- 
anl of one such learning algorilhm, Ihe regular¬ 
ized dual averaging algorilhm (Xiao, 2009), which 
keeps Irack of Ihe online average gradienl al time 
9t = j 9t' Here, Ihe subgradienls do nol 
include terms for Ihe regularizer; Ihey are deriva¬ 
tives of Ihe unregularized objective (A = 0, r = 0) 
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Sparse, binary overcomplete 
vectors 


Figure 1: Methods for obtaining sparse overcomplete vectors (top, method A, §2.1) and sparse, binary 
overcomplete word vectors (bottom, method B, §2.2 and §2.4). Observed dense vectors of length L (left) 
are converted to sparse non-negative vectors (center) of length K which are then projected into the binary 
vector space (right), where L K. X is dense, A is sparse, and B is the binary word vector matrix. 
Strength of colors signify the magnitude of values; negative is red, positive is blue, and zero is white. 


with respect to a*. We define 


7 = -sign{gt,ij) 


rft 






where Now, using the av¬ 

erage gradient, the -regularized objective is op¬ 
timized as follows: 


at+l,i,j 


0, ifbt,i,il<A 
7, otherwise 


(4) 


X 

L 

A 

T 

K 

% Sparse 

Glove 

300 

1.0 

10“^ 

3000 

91 

SG 

300 

0.5 

10-5 

3000 

92 

GC 

50 

1.0 

10-5 

500 

98 

Multi 

48 

0.1 

10-5 

960 

93 


Table 1: Hyperparameters for learning sparse 
overcomplete vectors tuned on the WS-353 task. 
Tasks are explained in §B. The four initial vector 
representations X are explained in §A. 


where, is the jth element of sparse vector 

a* at the fth update and gt^ij is the correspond¬ 
ing average gradient. For obtaining nonnegative 
sparse vectors we take projection of the updated a* 
onto M>q by choosing the closest point in M^q ac¬ 
cording to Euclidean distance (which corresponds 
to zeroing out the negative elements): 




'O, if\gt,i,j\<X 
< 0, if 7 < 0 
_ 7 , otherwise 


(5) 


hot, fresh, fish, 1/2, wine, salt 
series, tv, appearances, episodes 
1975, 1976, 1968, 1970, 1977, 19W 

dress, shirt, ivory, shirts, pants 
upscale, affluent, catering, clientele 

Table 2: Highest frequency words in randomly 
picked word clusters of binary sparse overcom¬ 
plete Glove vectors. 


2.4 Binarizing Transformation 

Our aim with method B is to obtain word rep¬ 
resentations that can emulate the binary-feature space designed for various NLP tasks. We could 

































state this as an optimization problem: 

V 

arg min Y] ||xi - BhiWl + A||bi||i +r||D||2 

DgR-Lxif 

( 6 ) 

where B denotes the binary (and also sparse) rep¬ 
resentation. This is an mixed integer bilinear pro¬ 
gram, whieh is NP-hard (Al-Khayyal and Falk, 
1983). Unfortunately, the number of variables in 
the problem is ss KV whieh reaehes 100 million 
when V = 100,000 and K = 1,000, whieh is 
intraetable to solve using standard teehniques. 

A more traetable relaxation to this hard prob¬ 
lem is to first eonstrain the eontinuous represen¬ 
tation A to be nonnegative (i.e, a* G K>o; §2.2). 
Then, in order to avoid an expensive eomputation, 
we take the nonnegative word veetors obtained us¬ 
ing Eq. 3 and projeet nonzero values to 1, preserv¬ 
ing the 0 values. Table 2 shows a random set of 
word elusters obtained by (i) applying our method 
to Glove initial veetors and (ii) applying /c-means 
elustering {k = 100). In §3 we will find fhaf fhese 
veefors perform well quanfifafively. 

2.5 Hyperparameter Tuning 

Mefhods A and B have fhree hyperparamefers: fhe 
f'l-regularizalion penally A, fhe ^ 2 -regularizalion 
penally r, and fhe lenglh of fhe overeomplele word 
veefor represenlalion K. We perform a grid seareh 
on A G {0.1, 0.5,1.0} and K G {lOL, 20L}, se- 
leeling values fhaf maximizes performanee on one 
“developmenl” word similarily lask (WS-353, dis- 
eussed in §B) while aehieving al leas! 90% sparsity 
in overeomplele veetors. r was luned on one eol- 
leelion of initializing veetors (Glove, diseussed in 
§A) so lhal Ihe veetors in D are near unil norm. 
The four veelor represenlalions and Iheir eorre- 
sponding hyperparamefers seleeled by Ibis proee- 
dure are summarized in Table 1. These hyperpa- 
ramelers were ehosen for melhod A and relained 
for melhod B. 

3 Experiments 

Using methods A and B, we eonstrueted sparse 
overeomplele veelor representations A and B 
resp., starting from four initial veelor representa¬ 
tions X; these are explained in Appendix A. We 
used one benehmark evaluation (WS-353) to tune 
hyperparameters, resulting in the settings shown 
in Table 1; seven other tasks were used to evalu¬ 
ate the quality of the sparse overeomplele repre¬ 


sentations. The first of these is a word similar¬ 
ity task, where the seore is eorrelation with hu¬ 
man judgments, and the others are elassifieation 
aeeuraeies of an £ 2 -regularized logistie regression 
model trained using the word veetors. These tasks 
are deseribed in detail in Appendix B. 

3.1 Effects of Transforming Vectors 

First, we quantify the effeets of our transforma¬ 
tions by eomparing their output to the initial (X) 
veetors. Table 3 shows eonsistent improvements 
of sparsifying veetors (method A). The exeeptions 
are on the SimLex task, where our sparse veetors 
are worse than the skip-gram initializer and on par 
with the multilingual initializer. Sparsifieation is 
benetieial aeross all of the text elassifieation tasks, 
for all initial veetor representations. On average 
aeross all veetor types and all tasks, sparse over- 
eomplete veetors outperform their eorresponding 
initializers by 4.2 points.^ 

Binarized veetors (from method B) are also usu¬ 
ally better than the initial veetors (also shown in 
Table 3), and tend to outperform the sparsified 
variants, exeept when initializing with Glove. On 
average aeross all veetor types and all tasks, bina¬ 
rized overeomplele veetors outperform their eor¬ 
responding initializers by 4.8 points and the eon¬ 
tinuous, sparse intermediate veetors by 0.6 points. 

From here on, we explore more deeply the 
sparse overeomplele veetors from method A (de¬ 
noted by A), leaving binarization (method B) 
aside. 

3.2 Effect of Vector Length 

How does the length of the overeomplele veetor 
(K) affeet performanee? We foeus here on the 
Glove veetors, where L = 300, and report av¬ 
erage performanee aeross all tasks. We eonsider 
K = aL where a G {1,2,3,5,10,15,20}. Fig¬ 
ure 2 plots the average performanee aeross tasks 
against a. The earlier seleetion of A = 3,000 
{a = 10) gives the best result; gains are mono- 
tonie in a to that point and then begin to diminish. 

3.3 Alternative Transformations 

We eonsider two alternative transformations. The 
first preserves the original veetor length but 

^We report correlation on a 100 point scale, so that the 
average which includes accuracuies and correlation is equally 
representatitve of both. 





Vectors 

SimLex 

Corr. 

Send. 

Ace. 

TREC 

Ace. 

Sports 

Ace. 

Comp. 

Ace. 

Relig. 

Ace. 

NP 

Ace. 

Average 


X 

36.9 

77.7 

76.2 

95.9 

79.7 

86.7 

77.9 

76.2 

Glove 

A 

38.9 

81.4 

81.5 

96.3 

87.0 

88.8 

82.3 

79.4 


B 

39.7 

81.0 

81.2 

95.7 

84.6 

87.4 

81.6 

78.7 


X 

43.6 

81.5 

77.8 

97.1 

80.2 

85.9 

80.1 

78.0 

SG 

A 

41.7 

82.7 

81.2 

98.2 

84.5 

86.5 

81.6 

79.4 


B 

42.8 

81.6 

81.6 

95.2 

86.5 

88.0 

82.9 

79.8 


X 

9.7 

68.3 

64.6 

75.1 

60.5 

76.0 

79.4 

61.9 

GC 

A 

12.0 

73.3 

77.6 

77.0 

68.3 

81.0 

81.2 

67.2 


B 

18.7 

73.6 

79.2 

79.7 

70.5 

79.6 

79.4 

68.6 


X 

28.7 

75.5 

63.8 

83.6 

64.3 

81.8 

79.2 

68.1 

Multi 

A 

28.1 

78.6 

79.2 

93.9 

78.2 

84.5 

81.1 

74.8 


B 

28.7 

77.6 

82.0 

94.7 

81.4 

85.6 

81.9 

75.9 


Table 3: Performance comparison of transformed vectors to initial vectors X. We show sparse over¬ 
complete representations A and also binarized representations B. Initial vectors are discussed in §A and 
tasks in SB. 



Figure 2: Average performace across all tasks 
for sparse overcomplete vectors (A) produced by 
Glove initial vectors, as a function of the ratio of 
A to L. 


achieves a binary, sparse vector (B) by applying: 


1 if Xij > 0 
0 otherwise 


(V) 


The second transformation was proposed by 
Guo et al. (2014). Here, the original vector length 
is also preserved, but sparsity is achieved through: 

{ 1 if Xij > M~^ 

— 1 if Xij<M~ (8) 

0 otherwise 

where M~^ {M~) is the mean of positive-valued 
(negative-valued) elements of X. These vectors 
are, obviously, not binary. 


We find that on average, across initializing vec¬ 
tors and across all tasks that our sparse overcom¬ 
plete (A) vectors lead to better performance than 
either of the alternative transformations. 

4 Interpretability 

Our hypothesis is that the dimensions of sparse 
overcomplete vectors are more interpretable than 
those of dense word vectors. Following Murphy 
et al. (2012), we use a word intrusion experiment 
(Chang et al., 2009) to corroborate this hypothesis. 
In addition, we conduct qualitative analysis of in¬ 
terpretability, focusing on individual dimensions. 

4.1 Word Intrusion 

Word intrusion experiments seek to quantify the 
extent to which dimensions of a learned word rep¬ 
resentation are coherent to humans. In one in¬ 
stance of the experiment, a human judge is pre¬ 
sented with five words in random order and asked 
to select the “intruder.” The words are selected by 
the experimenter by choosing one dimension j of 
the learned representation, then ranking the words 
on that dimension alone. The dimensions are cho¬ 
sen in decreasing order of the variance of their 
values across the vocabulary. Four of the words 
are the top-ranked words according to j, and the 
“true” intruder is a word from the bottom half of 
the list, chosen to be a word that appears in the top 
10% of some other dimension. An example of an 
instance is: 

naval, industrial, technological, marine, identity 
















X: 

Glove 

SG 

GC 

Multi 

Average 

X 

76.2 

78.0 

61.9 

68.1 

71.0 

Eq. 7 

75.7 

75.8 

60.5 

64.1 

69.0 

Eq. 8 (Guo et al., 2014) 

75.8 

76.9 

60.5 

66.2 

69.8 

A 

79.4 

79.4 

67.2 

74.8 

75.2 


Table 4: Average performanee aeross all tasks and veetor models using different transformations. 


Vectors 

Al 

A2 

A3 

Avg. 

lAA 

K 

X 

61 

53 

56 

57 

70 

0.40 

A 

71 

70 

72 

71 

77 

0.45 


Table 5: Aeeuraey of three human annotators on 
the word intrusion task, along with the average 
inter-annotator agreement (Artstein and Poesio, 
2008) and Fleiss’ k (Davies and Fleiss, 1982). 

(The last word is the intruder.) 

We formed instanees from initializing veetors 
and from our sparse overeomplete veetors (A). 
Each of these two combines the four different ini¬ 
tializers X. We selected the 25 dimensions d in 
each case. Each of the 100 instances per condition 
(initial vs. sparse overeomplete) was given to three 
judges. 

Results in Table 5 confirm that the sparse over¬ 
complete vectors are more interpretable than the 
dense vectors. The inter-annotator agreement on 
the sparse vectors increases substantially, from 
57% to 71%, and the Eleiss’ k increases from 
“fair” to “moderate” agreement (Eandis and Koch, 
1977). 

4.2 Qualitative Evaluation of Interpretability 

If a vector dimension is interpretable, the top- 
ranking words for that dimension should display 
semantic or syntactic groupings. To verify this 
qualitatively, we select five dimensions wifh fhe 
highesf variance of values in inifial and sparsi- 
fied GC vectors. We compare fop-ranked words in 
fhe dimensions exfracfed from fhe fwo represenfa- 
fions. The words are lisfed in Table 6, a dimension 
per row. Subjectively, we find fhe semantic group¬ 
ings better in fhe sparse vecfors fhan in fhe inifial 
vecfors. 

Eigure 3 visualizes fhe sparsified GC vecfors for 
six words. The dimensions are sorfed by fhe aver¬ 
age value across fhe fhree “animal” vectors. The 
animal-relaled words use many of fhe same di¬ 
mensions (102 common active dimensions ouf of 
500 fofal); in consfrasl, fhe fhree cify names use 


X 

combat, guard, honor, bow, trim, naval 

Tl, could, faced, lacking, seriously, scored 
see, n’t, recommended, depending, part 
due, positive, equal, focus, respect, better 
sergeant, comments, critics, she, videos 

A 

fracture, breathing, wound, tissue, relief 
relationships, connections, identity, relations 
files, bills, titles, collections, poems, songs 
naval, industrial, technological, marine 
stadium, belt, championship, toll, ride, coach 


Table 6: Top-ranked words per dimension for ini¬ 
tial and sparsified GC represenfafions. Each line 
shows words from a differenl dimension. 


mosfly disfincf vectors. 

5 Related Work 

To the best of our knowledge, there has been no 
prior work on obtaining overeomplete word vec¬ 
tor representations that are sparse and categorical. 
However, overeomplete features have been widely 
used in image processing, computer vision (01- 
shausen and Eield, 1997; Eewicki and Sejnowski, 
2000) and signal processing (Donoho et al., 2006). 
Nonnegative matrix factorization is often used for 
interpretable coding of information (Eee and Se- 
ung, 1999; Eiu et al., 2003; Cichocki et al., 2009). 

Sparsity constraints are in general useful in NEP 
problems (Kazama and Tsujii, 2003; Eriedman 
et al., 2008; Goodman, 2004), like POS tagging 
(Ganchev et al., 2009), dependency parsing (Mar¬ 
tins et al., 2011), text classification (Yogatama and 
Smith, 2014), and representation learning (Ben- 
gio et al., 2013; Yogatama et al., 2015). Includ¬ 
ing sparsity constraints in Bayesian models of lex¬ 
ical semantics like EDA in the form of sparse 
Dirichlet priors has been shown to be useful for 
downstream tasks like POS-tagging (Toutanova 
and Johnson, 2007), and improving interpretation 
(Paul and Dredze, 2012; Zhu and Xing, 2012). 









fish 

horse 

dog 

Chicago 

Seattle 

boston 


Figure 3: Visualization of sparsified GC vectors. Negative values are red, positive values are blue, zeroes 
are white. 


6 Conclusion 

We have presented a method that converts word 
vectors obtained using any state-of-the-art word 
vector model into sparse and optionally binary 
word vectors. These transformed vectors appear to 
come closer to features used in NLP tasks and out¬ 
perform the original vectors from which they are 
derived on a suite of semantics and syntactic eval¬ 
uation benchmarks. We also find that the sparse 
vectors are more interpretable than the dense vec¬ 
tors by humans according to a word intrusion de¬ 
tection test. 
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A Initial Vector Representations (X) 

Our experiments consider four publicly available 
collections of pre-trained word vectors. They vary 
in the amount of data used and the estimation 
method. 

Glove. Global vectors for word representations 
(Pennington et al., 2014) are trained on aggregated 
global word-word co-occurrence statistics from a 
corpus. These vectors were trained on 6 billion 
words from Wikipedia and English Gigaword and 
are of length 300.^ 

^http:// WWW- nip.stanford.edu/projects/ 
glove/ 


Skip-Gram (SG). The word2vec tool (Mikolov 
et al., 2013) is fast and widely-used. In this model, 
each word’s Huffman code is used as an input to 
a log-linear classiher with a continuous projection 
layer and words within a given context window are 
predicted. These vectors were trained on 100 bil¬ 
lion words of Google news data and are of length 
300.^ 

Global Context (GC). These vectors are 
learned using a recursive neural network that 
incorporates both local and global (document- 
level) context features (Huang et al., 2012). These 
vectors were trained on the hrst 1 billion words of 
English Wikipedia and are of length 50.^ 

Multilingual (Multi). Earuqui and Dyer (2014) 
learned vectors by hrst performing SVD on text 
in different languages, then applying canonical 
correlation analysis on pairs of vectors for words 
that align in parallel corpora. These vectors were 
trained on WMT-2011 news corpus containing 
360 million words and are of length 48.^ 

B Evaluation Benchmarks 

Our comparisons of word vector quality consider 
hve benchmark tasks. We now describe the differ¬ 
ent evaluation benchmarks for word vectors. 

Word Similarity. We evaluate our word repre¬ 
sentations on two word similarity tasks. The hrst 
is the WS-353 dataset (Einkelstein et al., 2001), 
which contains 353 pairs of English words that 
have been assigned similarity ratings by humans. 
This dataset is used to tune sparse vector learning 
hyperparameters (§2.5), while the remaining of the 
tasks discussed in this section are completely held 
out. 

"^https : / /code . google . com/p/word2vec 
^http://nip.Stanford.edu/~socherr/ 
ACL2012_wordVectorsTextFile. zip 

^http://cs.emu.edu/'mfaruqui/soft.html 



















































A more recent dataset, SimLex-999 (Hill et al, 
2014), has been constructed to specifically focus 
on similarity (rather than relatedness). It con¬ 
tains a balanced set of noun, verb, and adjective 
pairs. We calculate cosine similarity between the 
vectors of two words forming a test item and re¬ 
port Spearman’s rank correlation coefficient (My¬ 
ers and Well, 1995) between the rankings pro¬ 
duced by our model against the human rankings. 

Sentiment Analysis (Senti). Socher et al. 
(2013) created a treebank of sentences anno¬ 
tated with fine-grained sentiment labels on phrases 
and sentences from movie review excerpts. The 
coarse-grained treebank of positive and negative 
classes has been split into training, development, 
and test datasets containing 6,920, 872, and 1,821 
sentences, respectively. We use average of the 
word vectors of a given sentence as feature for 
classification. The classifier is tuned on the 
dev. set and accuracy is reported on the test set. 

Question Classification (TREC). As an aid to 

question answering, a question may be classi¬ 
fied as belonging to one of many question types. 
The TREC questions dataset involves six differ¬ 
ent question types, e.g., whether the question is 
about a location, about a person, or about some nu¬ 
meric information (Li and Roth, 2002). The train¬ 
ing dataset consists of 5,452 labeled questions, and 
the test dataset consists of 500 questions. An av¬ 
erage of the word vectors of the input question is 
used as features and accuracy is reported on the 
test set. 

20 Newsgroup Dataset. We consider three bi¬ 
nary categorization tasks from the 20 News- 
groups dataset.^ Each task involves categoriz¬ 
ing a document according to two related cate¬ 
gories with training/dev./test split in accordance 
with Yogatama and Smith (2014): (1) Sports: 
baseball vs. hockey (958/239/796) (2) Comp.: 
IBM vs. Mac (929/239/777) (3) Religion: atheism 
vs. Christian (870/209/717). We use average of the 
word vectors of a given sentence as features. The 
classifier is tuned on the dev. set and accuracy is 
reported on the test set. 

NP bracketing (NP). Lazaridou et al. (2013) 
constructed a dataset from the Penn Treebank 
(Marcus et ah, 1993) of noun phrases (NP) of 

'^http : / / qwone .com/~jason/2 ONews groups 


length three words, where the first can be an ad¬ 
jective or a noun and the other two are nouns. The 
task is to predict the correct bracketing in the parse 
tree for a given noun phrase. Eor example, local 
(phone company) and (blood pressure) medicine 
exhibit right and left bracketing, respectively. We 
append the word vectors of the three words in the 
NP in order and use them as features for binary 
classification. The dataset contains 2,227 noun 
phrases split into 10 folds. The classifier is tuned 
on the first fold and cross-validation accuracy is 
reported on the remaining nine folds. 
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